Import the required libraries. We will use pandas, seaborn and matplotlib for data visualisations, numpy for some statistical computing, zipfile to extract the initial .zip file downloaded from Udacity and plot inline with the magic % command.
import numpy as np
import pprint
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import zipfile
import os
%matplotlib inline
%config InlineBackend.figure_format = "retina"
# copied from https://stackoverflow.com/questions/21971449/how-do-i-increase-the-cell-width-of-the-jupyter-ipython-notebook-in-my-browser
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
Initially, extract the zipfile, then import the dataframe as a whole which is quite computationally expensive as well as the explanations dictionary. The entire .zip-file is several gigabites in size (2.3 GB), so uploading to github is not possible. However the dataset can be obtained via https://s3.amazonaws.com/udacity-hosted-downloads/ud507/pisa2012.csv.zip.
with zipfile.ZipFile(os.path.join(os.getcwd(), "pisa2012.csv.zip")) as zipped:
zipped.extractall()
pisa_df = pd.read_csv("pisa2012.csv", sep=",", encoding='latin-1')
pisa_dict = pd.read_csv("pisadict2012.csv", sep=",", encoding='latin-1')
Explore the dataset programmatically to identify any data tidiness or quality issues that we need to adress before starting exploratory data analysis. Also reduce the dataset size to only keep samples or variables that are of interest in the further exploration. For data wrangling, though computationally intense, create copies of both datasets.
pisa = pisa_df.copy()
pdict = pisa_dict.copy()
pisa.head()
pisa.describe()
pisa.info()
pisa.shape
As handling the dictionary-like style of the pdictdataframe is quite, create a dictionary pisadictfor rapid lookups between column abbreviations and their respective meaning. This should increase comfort in cleaning the dataset.
pdict.rename(columns={"Unnamed: 0":"key",
"x":"value"}, inplace=True)
pisadict = dict(zip(pdict.key, pdict.value))
pdict.head(10)
len(pisadict)
The entire dataframe contains 635 different columns, almost all of which are not in our main focus for this task. Therefore I manually selected column IDs as the index of the pdict dataframe using numpys .r_[ ] method to create a slice of column indices to retrieve from the pdict keys.
relevant_keys = np.r_[0,6,7,9:12, 17:20, 26, 31, 32, 37, 43:57,60:66, 140:145, 460, 466, 476]
relevant_keys
columns = pdict.iloc[relevant_keys]["key"].to_list()
Also the relevant columns containing students math, science and reading score are in the dataset. All of them contain the regex PV(possible value) and can be extracted by some simple logic.
for key, value in pisadict.items():
if "PV" in key and "subscale" not in value:
columns.append(key)
print(columns)
pisa = pisa.loc[:,columns]
pisa.shape
pisa.head(3)
Check for NaN values in our dataset, that now contains only 56 columns (variables). Also check out the corresponding datatypes.
[print("Col: ", col, "NA: ", pisa[col].isna().sum(), "/////", "Dtype:", pisa[col].dtype) for col in columns];
In case we can remove NaN values from floattype columns, replace them with the .mean( ) of the column.
for col in columns:
if pisa[col].dtype == "float64":
pisa[col].fillna(pisa[col].mean())
[print("Col: ", col, "NA: ", pisa[col].isna().sum(), "/////", "Dtype:", pisa[col].dtype) for col in columns];
pisa.duplicated().any()
Rename column abbreviations with their true name, then shorten it.
for col in pisa.columns:
pisa.rename(columns={col:pisadict[col]}, inplace=True)
pisa.columns
rename_dict = {'Country code 3-character': "Country_Code",
'Student ID': "Student_ID",
'International Grade': "International_Grade",
'Birth - Month': "Birth_Month",
'Birth -Year': "Birth_Year",
'Gender':"Sex",
'Truancy - Late for School': "Late_For_School",
'Truancy - Skip whole school day': "Skip_Whole_School_Day",
'Truancy - Skip classes within school day': "Skip_Classes_Within_School",
'Mother<Highest Schooling>':"Mother_Highest_Schooling",
'Mother Current Job Status':"Mother_Job_Status",
'Father<Highest Schooling>':"Father_Highest_Schooling",
'Father Current Job Status':"Father_Job_Status",
'Possessions - desk':"Poss_Desk",
'Possessions - own room':"Poss_Own_Room",
'Possessions - study place':"Poss_Study_Place",
'Possessions - computer':"Poss_Computer",
'Possessions - software':"Poss_Software",
'Possessions - Internet':"Poss_Internet",
'Possessions - literature':"Poss_Literature",
'Possessions - poetry':"Poss_Poetry",
'Possessions - art':"Poss_Art",
'Possessions - textbooks':"Poss_Textbooks",
'Possessions - <technical reference books>':"Poss_Technical_Ref_Book",
'Possessions - dictionary':"Poss_Dict",
'Possessions - dishwasher':"Poss_Dishwasher",
'Possessions - <DVD>':"Poss_DVD",
'How many - cellular phones':"Count_Cell_Phone",
'How many - televisions':"Count_TV",
'How many - computers':"Count_Computer",
'How many - cars':"Count_Cars",
'How many - rooms bath or shower':"Count_Bath_Rooms",
'How many books at home':"Count_Books",
'Out-of-School Study Time - Homework':"Homework_Time",
'Out-of-School Study Time - Guided Homework':"Guided_Homework_Time",
'Out-of-School Study Time - Personal Tutor':"Personal_Tutor_Time",
'Out-of-School Study Time - Commercial Company':"Commercial_Company_Time",
'Out-of-School Study Time - With Parent':"Study_Time_with_Parent",
'Learning time (minutes per week) - <test language>':"Learn_Time_Test_Language",
'Learning time (minutes per week)- <Mathematics>':"Learn_Time_Test_Math",
'Learning time (minutes per week) - <Science>':"Learn_Time_Test_Science",
'Plausible value 1 in mathematics':"Value_1_Math",
'Plausible value 2 in mathematics':"Value_2_Math",
'Plausible value 3 in mathematics':"Value_3_Math",
'Plausible value 4 in mathematics':"Value_4_Math",
'Plausible value 5 in mathematics':"Value_5_Math",
'Plausible value 1 in reading':"Value_1_Read",
'Plausible value 2 in reading':"Value_2_Read",
'Plausible value 3 in reading':"Value_3_Read",
'Plausible value 4 in reading':"Value_4_Read",
'Plausible value 5 in reading':"Value_5_Read",
'Plausible value 1 in science':"Value_1_Science",
'Plausible value 2 in science':"Value_2_Science",
'Plausible value 3 in science':"Value_3_Science",
'Plausible value 4 in science':"Value_4_Science",
'Plausible value 5 in science':"Value_5_Science"}
pisa.rename(rename_dict, axis=1, inplace=True)
pisa.info()
For later easier wrangling of related group of columns, store column names that match to the categories below as list type.
scores = []
learning_time = []
counts = []
possessions = []
truancy = []
parents = []
base_info = []
for key, value in rename_dict.items():
if key.startswith("Plausible"):
scores.append(value)
elif key.startswith("Learning") or key.startswith("Out-of-School"):
learning_time.append(value)
elif key.startswith("How many"):
counts.append(value)
elif key.startswith("Possessions"):
possessions.append(value)
elif key.startswith("Truancy"):
truancy.append(value)
elif key.startswith("Mother") or key.startswith("Father"):
parents.append(value)
else:
base_info.append(value)
Issues:
possessions list are of type object (string), convert them to categoric.counts list contain missing information, as this is not restorable, drop rows with NaN.counts list are of type object (string), convert them to categoric with order (increasing).Late_For_School and Skip_Whole_School_Day are of type object (string), convert them to sorted categorical.parents list contain NaN values. As these don't appear to mean None (as it turns out as its own value inside this column) we cannot know what these values could mean, most likely the were not explicitly recorded as None, or the meant something totally different. In any case drop them to avoid biasing our results based on assuming their meaning. Sex column is of type object, convert it to category.possessions list are of type object (string), convert them to categoric.for pos in possessions:
pisa[pos] = pisa[pos].astype("category")
print(pisa[counts].isna().any())
for count in counts:
pisa = pisa.loc[pisa[count].notnull(),:]
print("-------------------------")
print(pisa[counts].isna().any())
counts list contain missing information, as this is not restorable, drop rows with NaN.counts list are of type object (string), convert them to categoric with order (increasing).for count in counts:
print(count, "--->", pisa[count].unique())
count_order = ["None", "One", "Two", "Three or more"]
count_classes = pd.api.types.CategoricalDtype(categories = count_order, ordered=True)
count_book_order = ['0-10 books ', '11-25 books ', '26-100 books ', '101-200 books ', '201-500 books ', 'More than 500 books']
count_book_classes = pd.api.types.CategoricalDtype(categories = count_book_order, ordered=True)
for count in counts:
if sorted(count_order) == sorted(pisa[count].unique()):
pisa[count] = pisa[count].astype(count_classes)
else:
pisa[count] = pisa[count].astype(count_book_classes)
Late_For_School and Skip_Whole_School_Day are of type object (string), convert them to sorted categorical.pisa.Late_For_School.unique()
pisa = pisa.loc[pisa.Late_For_School.notnull(),:]
late_for_school_order = list(pisa.Late_For_School.unique())
late_for_school_classes = pd.api.types.CategoricalDtype(categories = late_for_school_order, ordered=True)
pisa["Late_For_School"] = pisa["Late_For_School"].astype(late_for_school_classes)
pisa = pisa.loc[pisa.Skip_Whole_School_Day.notnull(),:]
skip_whole_school_day_order = list(pisa.Skip_Whole_School_Day.unique())
skip_whole_school_day_classes = pd.api.types.CategoricalDtype(categories = skip_whole_school_day_order, ordered=True)
pisa["Skip_Whole_School_Day"] = pisa["Skip_Whole_School_Day"].astype(skip_whole_school_day_classes)
parents list contain NaN values. As these don't appear to mean None (as it turns out as its own value inside this column) we cannot know what these values could mean, most likely the were not explicitly recorded as None, or the meant something totally different. In any case drop them to avoid biasing our results based on assuming their meaning.pisa = pisa.loc[pisa.Mother_Highest_Schooling.notnull(),:]
pisa = pisa.loc[pisa.Father_Highest_Schooling.notnull(),:]
pisa = pisa.loc[pisa.Mother_Job_Status.notnull(),:]
pisa = pisa.loc[pisa.Father_Job_Status.notnull(),:]
pisa["Mother_Highest_Schooling"].replace('She did not complete <ISCED level 1> ', "<ISCED level 0>", inplace=True)
pisa["Father_Highest_Schooling"].replace('He did not complete <ISCED level 1> ', "<ISCED level 0>", inplace=True)
print(pisa.Mother_Highest_Schooling.unique())
print(pisa.Father_Highest_Schooling.unique())
isced_levels = sorted(pisa.Mother_Highest_Schooling.unique())
isced_classes = pd.api.types.CategoricalDtype(categories=isced_levels, ordered=True)
for par in parents:
if str(par).find("Job_Status"):
pisa[par] = pisa[par].astype("category")
else:
pisa[par] = pisa[par].astype(isced_classes)
1. Create a "Total_Learning_Time" column that sums up all variables contained in the learning_time list
2. Create average columns for math, science and reading scores: "math_avg_score", "read_avg_score" and "science_avg_score"
Total_Learning_Time column that sums up all variables contained in the learning_time list.pisa[learning_time].head()
pisa[learning_time] = pisa[learning_time].replace(np.nan, 0)
pisa["Total_Learning_Time"] = sum([pisa[x] for x in learning_time])
pisa["Total_Learning_Time"].head()
math_avg_score, read_avg_score and science_avg_scorescores
pisa["global_score"] = sum(pisa[score] for score in scores) / len(scores)
pisa["math_score"] = sum(pisa[score] for score in scores[:5]) / len(scores[:5])
pisa["read_score"] = sum(pisa[score] for score in scores[5:10]) / len(scores[5:10])
pisa["science_score"] = sum(pisa[score] for score in scores[10:]) / len(scores[10:])
print(pisa.global_score.head(3))
print("--------------")
print(pisa.math_score.head(3))
print("--------------")
print(pisa.read_score.head(3))
print("--------------")
print(pisa.science_score.head(3))
print(pisa.global_score.describe())
print("----------------------")
print(pisa.math_score.describe())
print("----------------------")
print(pisa.read_score.describe())
print("----------------------")
print(pisa.science_score.describe())
Sex column is of type object, convert it to category.pisa["Sex"] = pisa.Sex.astype("category")
pisa.describe()
pisa.info()
pisa.head(100)
pisa.dropna(axis=0, inplace=True)
len(pisa.columns)
order = pisa["Country_Code"].value_counts()
plt.style.use("seaborn");
plt.figure(figsize=(10,10));
sns.countplot(y=pisa["Country_Code"], order = order.index, color="red");
plt.savefig("Countries.png")
Key findings
Italy and Mexico are the countries where most of the students have participated. Overall participation seems not to be related to number of inhabitants in the repsective country, as Germany for instance with its 80 million of inhabitants is on the lower level of the chart. It also appears, that the USA are split into three regions (Florida, Massachusetts and Connecticut), which does not appear to be really intuitive.
pisa.info()
pisa_corr = pisa.copy()
pisa_corr.Sex.replace({"Female":1, "Male":0}, inplace=True)
for pos in possessions:
pisa_corr[pos].replace({"Yes":1, "No":0}, inplace=True)
for count in counts:
print(pisa_corr[count].unique())
for count in counts[:-1]:
pisa[count].replace({"None":0, "One":1, "Two":2, "Three or more": 3}, inplace=True)
plt.figure(figsize=(30,30))
fig = sns.heatmap(pisa_corr.corr(),
cmap="coolwarm", annot=True, fmt=".2f",
linewidths=0.05, linecolor="black");
fig.patch.set_edgecolor("black");
fig.patch.set_linewidth("1")
plt.savefig("Correlation_plot.png")
Key findings
As expected, score results are highly correlated between each other. Students achieving high test results in science, also trend to reach good grades in math etc. Also some other obvios trends can be obtained from this plot, for instance students possing a computer are highly likely to also have internet. Though this doesnt appear too relevant for our project here, it can help gaining confidence that our multiple data wrangling steps did not introduce any errors into our dataset.
plt.style.use("seaborn")
##### reset to default plt style #####
#plt.rcParams.update(plt.rcParamsDefault)
#%matplotlib inline
#%config InlineBackend.figure_format = "retina"
######################################
plt.figure(figsize=(15,12));
plt.subplot(2,2,1);
bins = np.arange(0, pisa.global_score.max()+20,20);
plt.hist(data=pisa, x="global_score", bins=bins, color="lightgreen");
plt.title("Histogram of Global Average Score of all PISA 2012 students");
plt.xlabel("Average Score in Points");
plt.ylabel("Count of Students");
plt.subplot(2,2,2);
bins = np.arange(0, pisa.global_score.max()+20,20);
plt.hist(data=pisa, x="math_score", bins=bins, color="lightblue");
plt.title("Histogram of Math Average Score of all PISA 2012 students");
plt.xlabel("Average Score in Points");
plt.ylabel("Count of Students");
plt.subplot(2,2,3);
bins = np.arange(0, pisa.global_score.max()+20,20);
plt.hist(data=pisa, x="read_score", bins=bins, color="lightblue");
plt.title("Histogram of Reading Average Score of all PISA 2012 students");
plt.xlabel("Average Score in Points");
plt.ylabel("Count of Students");
plt.subplot(2,2,4);
bins = np.arange(0, pisa.global_score.max()+20,20);
plt.hist(data=pisa, x="science_score", bins=bins, color="lightblue");
plt.title("Histogram of Science Average Score of all PISA 2012 students");
plt.xlabel("Average Score in Points");
plt.ylabel("Count of Students");
plt.savefig("Histogram_Scores.png")
Key findings
All test results appear to be normally distributed, with a center around values of 500 points. No skewness is observed. This would fit the central limit theorem (CLT), that the distribution of datasets with large sample sizes trend towards a almost normal distribution.
axes = []
fig = plt.figure(figsize=(20,10));
n_rows = 2
n_cols = 4
for idx, col in enumerate(learning_time):
axes.append(fig.add_subplot(n_rows, n_cols, idx+1));
data = pisa[col]
bins = np.arange(0, data.max() + (data.max()/10), 1)
if col.startswith("Learn"):
bins = np.arange(0, data.max() + (data.max()/10), 100)
plt.hist(data, color="lightblue", bins=bins)
plt.title(col)
plt.yscale("log")
plt.xlabel("Time in h")
plt.ylabel("# of students")
plt.yticks([10,100,1000,10000,100000,1e6], ["10","100","1k","10k","100k","1m"])
plt.savefig("Study_Time_Histograms.png")
Key findings
As expected, learning times / study times for all categories show a negative correlation to the number of students that fall into each category (wacth out for the logarithmic scale on the y-axis).
gender_diffs = pisa.groupby("Sex")["Total_Learning_Time"]
gender_diffs.agg(["mean", "median", "std"])
bins = np.linspace(0, np.max(pisa.Total_Learning_Time)+10,30)
plt.figure(figsize=[4,4])
fig = plt.hist(x = pisa.query("Sex == 'Female'")["Total_Learning_Time"], bins = bins, color="lightgreen", alpha=1, label="Female")
fig = plt.hist(x = pisa.query("Sex == 'Male'")["Total_Learning_Time"], bins = bins, color="lightblue", alpha=1, label="Male")
plt.xlim(0, 2500);
plt.yticks([1e4, 2e4, 3e4, 4e4, 5e4, 6e4, 7e4, 8e4], ["10k","20k","30k","40k","50k","60k","70k", "80k"]);
plt.ylabel("Counts in thousands (1k=1000)");
plt.xlabel("Total Learning Time in hours");
plt.legend(loc="upper right");
plt.title("Learning time separated for boys and girls");
plt.savefig("Learning_Time_Boys_Girls.png")
axes = []
nrows = 2
ncols = 4
fig = plt.figure(figsize=(30,10))
for idx, col in enumerate(learning_time):
axes.append(fig.add_subplot(nrows, ncols, idx+1))
plt.title("Heatmap comparing Global Test Scores with '{}'".format(col.replace("_", " ")))
plt.hist2d(data=pisa, x=col, y="global_score", cmap="mako_r", cmin=0.8)
plt.colorbar();
plt.savefig("Heatmap_Learning_Times.png")
Key findings
Female participants trend to have a slightly higher motivation to spend time learning for school than boys have. To validate this for statistical significance we could either use t-test or bootstrapping.
fig = plt.figure(figsize=(30,10))
axes = []
nrows = 2
ncols = 7
for idx, col in enumerate(possessions):
axes.append(fig.add_subplot(nrows, ncols, idx+1))
sns.countplot(data=pisa, x=col, hue="Sex", palette=["lightgreen", "lightblue"], order=["Yes","No"])
plt.yticks([x for x in np.arange(0, 200000, 20000)], ["{}k".format(str(x)[:-3]) if x!=0 else str(x) for x in np.arange(0, 200000, 20000)]);
plt.title(col.replace("_", " "))
plt.xlabel(None)
plt.ylabel("# Counts (1k=1000)")
plt.savefig("Possessions.png")
pisa.groupby("Sex")[possessions].agg("count")
Key findings
I here used barplots to count the number of students possessing different items, separated by gender. Girls tend to have more poetry, art and text books than boys, while the latter tend to have more technical reference books.
counted = pisa.groupby("Sex")[counts]
counted.agg(["mean", "median", "min", "max", "std"])
fig = plt.figure(figsize=(20,6))
axes = []
nrows = 2
ncols = 3
for idx, col in enumerate(counts):
axes.append(fig.add_subplot(nrows, ncols, idx+1))
if col != "Count_Books":
sns.countplot(data=pisa, x=col, hue="Sex", palette=["lightgreen", "lightblue"], order=[0,1,2,3])
else:
sns.countplot(data=pisa, x=col, hue="Sex", palette=["lightgreen", "lightblue"])
plt.xticks(rotation = 10)
plt.yticks([x for x in np.arange(0, 200000, 20000)], ["{}k".format(str(x)[:-3]) if x!=0 else str(x) for x in np.arange(0, 200000, 20000)]);
plt.title(col.replace("_", " "))
plt.ylabel("# Counts (1k=1000)")
plt.xlabel(None)
plt.savefig("Counts.png")
pisa.groupby("Count_Bath_Rooms")["global_score"].mean()
fig = plt.figure(figsize=(20,10))
nrows = 2
ncols = 3
axes = []
for idx, col in enumerate(counts):
axes.append(fig.add_subplot(nrows, ncols, idx+1))
sns.pointplot(data=pisa, x=col, y="global_score", hue="Sex", palette=["lightgreen", "lightblue"]);
plt.ylim([0, 600]);
plt.suptitle("Correlation between score results and number of items")
plt.savefig("Counts_Scores.png")
Key findings
The latter plot is quite interesting. The more bathrooms a student has in his house the better his score results are (while the last shows a slight decrease). This might largely be explained, that number of bathrooms is an indicator of luxury, thus studens like these might have a better access to education, resources like computer and heavily expensive school textbooks. All these trends also occur for number of cars, cell phones at home and computer.
variables = []
variables.extend(counts)
variables.extend(["global_score", "math_score", "science_score", "read_score"])
variables
pisa_subset = pisa.sample(500, replace=True)
g = sns.PairGrid(data = pisa_subset, vars = variables);
g.map_diag(plt.hist, color="lightblue");
g.map_offdiag(plt.scatter, color="lightgreen");
pisa.Count_Books.dtype
pisa["Count_Books_Numeric"] = pisa["Count_Books"].replace({"0-10 books ": np.mean((0,10)),
"11-25 books ": np.mean((11,25)),
"26-100 books ": np.mean((26,100)),
"101-200 books ": np.mean((101,200)),
"201-500 books ": np.mean((201,500)),
"More than 500 books": 501})
variables.append("Count_Books_Numeric")
pisa["Count_Books_Numeric"].head()
palette = sns.color_palette("Blues")
plt.figure(figsize=(10,10))
fig = sns.heatmap(pisa[variables].corr(),
cmap=palette, annot=True, fmt=".2f", vmin=0.2);
fig.patch.set_edgecolor("black");
fig.patch.set_linewidth("1")
plt.title("Correlation heatmap between count of luxury items and global test scores")
plt.xticks(rotation = 20)
plt.savefig("Correlation_plot_possession_scores.png")
sns.barplot(data=pisa, x="Count_Books", y="global_score", palette=["lightgreen", "lightblue"], ci="sd", hue="Sex", errwidth = 0.75);
plt.xlabel(None);
plt.xticks(rotation=20)
plt.ylabel("Global Test Score Result");
plt.title("Comparison between # of Books and Global Test Results separated by Gender");
plt.savefig("Books_Global_Scores.png")
Key findings
Not surprisingly, the number of books shows a linear trend to simultaneously rise when test scores rise. As we cannot conclude any direction from the data, it is however obvious that the more books students will own, the more likely they will acquire knowledge while reading them and gain better test scores.
plt.figure(figsize=(25,6))
plt.suptitle("Average Score results for number of Books at home separted by Boys and Girls");
plt.subplot(2,2,1);
sns.violinplot(data=pisa, x="Count_Books", y="global_score", hue="Sex", palette=["lightgreen", "lightblue"]);
plt.xlabel(None);
plt.subplot(2,2,2);
sns.violinplot(data=pisa, x="Count_Books", y="math_score", hue="Sex", palette=["lightgreen", "lightblue"]);
plt.xlabel(None);
plt.subplot(2,2,3);
sns.violinplot(data=pisa, x="Count_Books", y="science_score", hue="Sex", palette=["lightgreen", "lightblue"]);
plt.xlabel(None);
plt.subplot(2,2,4);
sns.violinplot(data=pisa, x="Count_Books", y="read_score", hue="Sex", palette=["lightgreen", "lightblue"]);
plt.xlabel(None);
plt.savefig("Avg_scores_hueGender_books.png")
plt.figure(figsize=(8,14))
plt.suptitle("Average Score results for number of Books at home separted by Boys and Girls");
plt.subplot(4,1,1);
sns.pointplot(data=pisa, y="Count_Books", x="global_score", hue="Sex", palette=["lightgreen", "lightblue"], ci="sd", linestyles = "", dodge=True);
plt.xlabel(None);
plt.title("Global Score");
plt.subplot(4,1,2);
sns.pointplot(data=pisa, y="Count_Books", x="math_score", hue="Sex", palette=["lightgreen", "lightblue"], ci="sd", linestyles = "", dodge=True);
plt.xlabel(None);
plt.title("Math Score");
plt.subplot(4,1,3);
sns.pointplot(data=pisa, y="Count_Books", x="science_score", hue="Sex", palette=["lightgreen", "lightblue"], ci="sd", linestyles = "", dodge=True);
plt.xlabel(None);
plt.title("Science Score");
plt.subplot(4,1,4);
sns.pointplot(data=pisa, y="Count_Books", x="read_score", hue="Sex", palette=["lightgreen", "lightblue"], ci="sd", linestyles = "", dodge=True);
plt.xlabel(None);
plt.title("Reading Score");
plt.savefig("Avg_scores_hueGender_books_pointplot.png")
axes = []
nrows = 1
ncols = 3
fig = plt.figure(figsize=(30,5))
for idx, col in enumerate(truancy):
axes.append(fig.add_subplot(nrows, ncols, idx+1))
plt.title("Boxplots comparing Global Test Scores with '{}'".format(col.replace("_", " ")))
sns.boxplot(data=pisa, x=col, y="global_score", hue="Sex", palette=["lightgreen", "lightblue"])
plt.savefig("Truancy_Scores_Boxplot.png")
Key findings
Interestingly, students beeing late for school, skipping classes or whole school days appear to linearly achieve worse results than those who never miss school (None). This trend is equal among girls and boys.
grouped = pisa.groupby("Country_Code")[["global_score", "math_score", "read_score", "science_score"]]
grouped.agg(["mean", "median", "std"])
plt.figure(figsize=(40,20));
plt.subplot(1,4,1)
top_ranking = pisa.groupby("Country_Code")["global_score"].median().sort_values(ascending=False).index
plt.title("Boxplots showing overall Test Statistics by countries")
sns.boxplot(data=pisa, y="Country_Code", x="global_score", order=top_ranking, palette="viridis_r");
plt.subplot(1,4,2)
top_ranking = pisa.groupby("Country_Code")["math_score"].median().sort_values(ascending=False).index
plt.title("Boxplots showing math_score Statistics by countries")
sns.boxplot(data=pisa, y="Country_Code", x="math_score", order=top_ranking, palette="mako_r");
plt.subplot(1,4,3)
top_ranking = pisa.groupby("Country_Code")["read_score"].median().sort_values(ascending=False).index
plt.title("Boxplots showing read_score Statistics by countries")
sns.boxplot(data=pisa, y="Country_Code", x="read_score", order=top_ranking, palette="mako_r");
plt.subplot(1,4,4)
top_ranking = pisa.groupby("Country_Code")["science_score"].median().sort_values(ascending=False).index
plt.title("Boxplots showing science_score Statistics by countries")
sns.boxplot(data=pisa, y="Country_Code", x="science_score", order=top_ranking, palette="mako_r");
plt.savefig("Boxplots_Best_Countries.png")
Key findings
Not unexpectedly the top students come from regions in China. My country (Germany) however isn't that bad as always considered by ourselves, always beeing represented in the top 10.
pisa.to_csv("pisa_cleaned_df.csv", index=False)